Java Cluster readme and help file

For further information contact: Martin.Irman@mail.mcgill.ca (if you would like to use the below-described software – feel free to contact me)

 

Java Cluster is a cluster environment written in Java for automating the execution of Matlab simulations on multiple computers (currently used on multiple Windows XP computers). It does not facilitate automatic parallel execution of a Matlab script (that would be nice, wouldn’t it :) but allows the user to schedule the execution of a single Matlab script for hundreds of runs automatically on multiple computers. The environment supports the execution of multiple parallel “jobs” (a “job” is a Matlab script that should be executed x times) where the executions will be scheduled according to job priorities (just, priority and idle scheduling possible). The environment further provide tools to combine the data generated in the multiple executions.

 

The automatic execution of other than Matlab simulation jobs would be possible with some modification to the cluster software.

 

This documentation file contains these parts:

  1. Cluster Environment
  2. Installing a client on your machine
  3. Installing full cluster software on a machine
  4. Using the Cluster
  5. Example of a script
  6. Troubleshooting
  7. Release notes

 

Do not move or rename this document. Use MsWord to edit.

Cluster Environment

The cluster environement consists of 3 componenet:

 

To be able to manage simulations from your computer you only have to run the ClusterClient. You need to set it up to connect to the computer running the ArbiterServer (this can be the same computer).

Installing a client on your machine

Installing full cluster software on a machine

(JobServer, ArbiterServer, ClusterClient)

 

Things that have to be set-up:

 

(you might try to use the setup scripts in the install_scripts folder - but

these are matched to a specific computer configuration)

 

1. Review cluster.properties:

      * The matlab path needs to be set

      * Comment out the STANDBY property if you don't want your computers

      * to go to standby automatically

 

2. If you have setup jour JobServer computers to go to stand-by automatically

   don't forget to setup Windows so that the computer automatically wakes up

   on network activity

 

3. Review install.bat. The batch file includes system specific information.

      * Change the account information for the "schtasks" commands

 

4. Add the java_cluster\matlab folder to the Matlab path!!!

 

5. Run install.bat to set-up automatic start of JobServer and/or

   ArbiterServer on start-up of the computer.

 

6. You can do this manually without running install.bat

 

7. If e-mail notification is desired, you have to set-up ‘blat’ with e-mail account

   information. The ‘blat’ is a command line utility that can send e-mails. It is

   included in the ‘system’ folder (detailed documentation is also included). To

   set it up run the following command from the command line:

 

      blat –install <mail_server> <sender_email> 3 25 cluster <login> <password>

 

   This only needs to be done once. As McGill use the following to this utility up:

 

      Blat –install mailhost.mcgill.ca Martin.Irman@mail.mcgill.ca 3 25 cluster <DAS USERNAME> <DAS PASSWORD>

 

   Now, user e-mail addresses have to be added to cluster.properties. Create a line:

      Martin = Martin.Irman@mail.mcgill.ca

   For every user. If this line is not added to the cluster.properties file for a user

   the user will not receive e-mail notification.

 

8. (not important) Review eject.bat and load.bat in the \system folder and correct

   the CD-ROM drive letter if you want to be able to command the CD to eject

Using the Cluster

Logging in:

 

How to run a simulation:

save results\matlab.mat

save results\matlab.mat a_variable b_variable

save results\ber.mat snr ber

Careful! The following does not work (Matlab does not like the leading backslash):

save \results\matlab.mat (DOES NOT WORK!)

This commands will save the current workspace (or the specified variables) in the file matlab.mat (or ber.mat). Later everything that is generated inside the results folder is moved back to the client where it can be analyzed using the jCombine command

 

What is priority:

 

Controlling the cluster:

 

Users and package scheduling:

 

Email notification:

Martin = Martin.Irman@mail.mcgill.ca

  jSetCombine(strMatlabFile,strVariable [,strName])

 

  Call this function from a simulation .m file to set which variabls should

  be combined automatically. Multiple jSetCombine statements can be

  included in the simulation script. The command has to be issued while the

  current derectory is either the results folder or it's parent folder.

 

  Sets which variables should be combine if these are not explicitelly

  specified in jCombine. Specify a matlab .mat file (strMatlabFile) and

  which variable to combine (strVariable). Optionally specify the name for

  the combined variable (strName) in the results. If not specified the

  strVariable is used ... it needs to be specified if strVariable uses

  wildcards i.e. '*'.

 

  jSetCombine('matlab.mat','SNR');

  jSetCombine('matlab.mat','*','all');

 

Blacklisting computers:

 

% Blacklist computer if it does not have enough memory

% (physical memory in MB).

requiredMemory = 200;

 

if (jSystemMemory < requiredMemory)

    % Construct en error message

    strCause = ['  Not enough memory on the system!          present: ' num2str(jSystemMemory) ' < required: ' num2str(requiredMemory)];   

    % Blacklist the computer - this function terminates matlab execution!!!

    jBlacklist(strCause);

end;

 

 

if (strcmp(jSystemName,’cluster5’))

    jBlacklist;

end;

 

Example of a script

% This is a example of a script to be used in a cluster simulation

 

% Blacklist computer if it does not have enough memory

% (physical memory in MB).

requiredMemory = 600;

 

if (jSystemMemory < requiredMemory)

    % Construct en error message

    strCause = ['  Not enough memory on the system!          present: ' num2str(jSystemMemory) ' < required: ' num2str(requiredMemory)];   

    % Blacklist the computer - this function terminates matlab execution!!!

    jBlacklist(strCause);

end;

 

% Here would be the code of the simulation we want to execute (let's just

% generate a random matrix in this example).

 

% Generate a random matrix

randn('state',sum(100*clock));

a = rand(2,2);

% Let's wait a few seconds ... as if we would be doing something

pause(30);

 

% Save the workspace in the results folder - only files inside the

% results folder will be retreived back to the cluster client

 

save results\matlab.mat

 

% You can use jCombine('rnd_num','matlab.mat','*') to retreive the

% saved workspace or just jCombine('rnd_num','matlab.mat','a') to

% retreive the variable 'a' from all the 'matlab.mat' files.

Troubleshooting

 

1. Matlab cannot find jCombine:

    * Add the \matlab subdirectory of your cluster data to the Matlab path!

 

2. Some computers have problems:

    * Reboot the problematic computers

    * Check if there is free space on the disk where the cluster data is

      located. If the computer is low on disk space. Run ClusterClient and

      use the 'Purge' button in the 'Servers' window. This will clean up old

      data. Check if it helped.

 

3. Some computers are running a job that should have long finished but it’s

   not finishing:

    * Matlab does not exit if another instance of Matlab is running at the

      same machine and this instance is unresponsive (running simulations).

      As soon as this Matlab finishes or if you kill it manually also the

      Matlab launched by the cluster will exit and the server will resume

      normal operation.

 

4. An error is reported by the cluster client: ‘error saving file’:

    * Make sure that you save jour data in the results forlder using the

      command ‘save results\matlab.mat’ and not ‘save \results\matlab.mat’.

      Matlab does not like the extra backslash.

 

5. Some computers are not good for running the simulations (they don’t have

   enough RAM and the swapping could damage the disk):

     * Blacklist the computers where you don’t want to run the simulation. If

       a script determines that a computer is not good enough for the simulation

       it can blacklist the computer from further executions of this script

       by calling jBlacklist (refer to the description above).

 

6. Blacklisting does not work, the computer is listed as blacklisted but the

   simulations start up on it anyway:

     * Make sure that the name of the computer in the cluster.properties file

       match the name that appears in the blacklist exactly. If not – change

       the name in the cluster.properties file to match. For example: localhost:8188

       does not match 127.0.0.1:8188, alpha4:80 does not match ALPHA4:80 which does

       not match ALPHA4:0080!

 

7. ClusterClient or ArbiterServer is slow:

     * Delete unused packages

     * ‘Purge’ the cluster using the purge button in the ClusterClient servers

       view. Note: this will erase all information about packages older than 30 days!

     * Optimize the file system. The amount of files in the ‘results’ folders might

       slow down the filesystem. You can speed-up the operation by disabling the

       short filename generation by using the following command:

 

       fsutil behavior set disable8dot3 1

 

       Note: some older application that use the short (8,3) filenames might stop to work!

 

9. When I command the servers to eject the CD, the servers don't do that:

    * Is the correct drive letter specified in the batch files eject.bat

      and load.bat on the servers in the \system folder?

 

10.Computer goes to stand-by and never wakes up again:

    * The wake-on-LAN capability has to be enabled to wake-up the computer after it was

      sent to stand-by by the arbiter. Try pinging the computer / connecting via remote

      desktop or otherwise disturb the computer – if it does not come on then there is

      some problem with wake-on-LAN.

    * Wake-on-LAN didn’t work with WinXP Service Pack 2 installed, when we uninstalled it

      and installed SP1 only the feature worked.

    * Bios or windows settings might be disabling this feature

    * If nothing helps disable the goto-standby feature on the computer by commenting out

      the STANDBY property in cluster.properties.

 

11.Arbiter cannot connect to JobServers / ClusterClient cannot connect to Arbiter:

    * Check windows firewall settings. Optionally enable ports 8188, 8189, 8191.

 

12.ClusterClient shuts down when starting Matlab. A error message is generated by the

   Java Virtual Machine that a exception occurred in the AWT package:

    * We had a issue with an old ATI graphics card driver that caused this. Try installing

      the newest drivers (do windows update – in our case it helped).

 

Release Notes

To do list: